Credit risk analysis plays a crucial role in the financial industry, enabling lenders to assess the creditworthiness of potential borrowers and make informed decisions about lending. With the increasing availability of data and advancements in machine learning techniques, credit risk analysis has seen significant improvements in accuracy and efficiency.
In this Jupyter Notebook, we will explore the process of credit risk analysis using real-world credit data. Our goal is to build a predictive model that can classify borrowers into riscky and not-riscky categories, helping financial institutions minimize losses and maximize profitability.
The dataset used in this analysis contains information about various borrowers, including their age, income, loan intent, loan amount, and previous credit history. Additionally, it includes the loan grade, which indicates the level of risk associated with each loan application (ranging from "A" for low risk to "G" for high risk) and many more features.
| feature | description |
| person_age | The person's age in years |
| person_income | The person's annual income. |
| person_home_ownership | The type of home ownership (RENT, OWN, MORTGAGE, OTHER) |
| person_emp_length | the person's employment length in years. |
| loan_intent | the person's intent for the loan (PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEEMPROVEMENT, DEBTCONSOLIDATION). |
| loan_grade | the of risk on the loan(A,B,C,D,E,F,G)(A-> not riscky | G-> very riscky |
| loan_amnt | the loan amount. |
| loan_int_rate | the loan interest rate (between 6% and 21%) |
| loan_status | Shows wether the loan is currently in default with 1 being default and 0 being non-default. |
| loan_percent_income | The percentage of person's income dedicated for the mortgage. |
| cb_person_default_on_file | If the person has a default history (YES , NO). |
| cb_person_cred_hist_length | The person's credit history. |
1. Exploratory Data Analysis (EDA): Through EDA, we will gain insights into the distribution of various features, explore correlations, and identify potential patterns or trends.
2. Data Preprocessing: We will begin by cleaning and preprocessing the data to handle missing values, encode categorical variables, and prepare the data for modeling.
3. Feature Selection: To build an effective credit risk model, we will select relevant features and examine their impact on the target variable.
4. Model Building Using machine learning algorithms such as XGBoost, Random Forest, and Logistic Regression, we will train predictive models to classify borrowers as low-risk or high-risk.
5. Hyperparameter Tuning: Fine-tuning the model hyperparameters will help optimize their performance and make more accurate predictions.
6. Model Evaluation: We will evaluate the performance of each model using appropriate metrics, such as accuracy, precision, recall, and F1 score.
7. Credit Risk Prediction: Using the selected model, we will predict the credit risk of new loan applicants and classify them into appropriate risk categories.
8. Conclusion: Finally, we will summarize our findings, discuss the model's effectiveness, and provide recommendations for future improvements.
## Basic Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib
warnings.filterwarnings("ignore")
%matplotlib inline
## For making sample data:
from sklearn.datasets import make_classification
## For Preprocessing:
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold,GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
## Using imblearn library:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
## Using msno Library for Missing Value analysis:
import missingno as msno
## For Metrics:
from sklearn.metrics import plot_precision_recall_curve,accuracy_score
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, classification_report
from sklearn.model_selection import learning_curve
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error
## For Machine Learning Models:
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
#for pickling
import pickle
## Setting the seed to allow reproducibility
np.random.seed(31415)
df = pd.read_csv("./credit_risk_dataset.csv")
df.head(10)
| person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 59000 | RENT | 123.0 | PERSONAL | D | 35000 | 16.02 | 1 | 0.59 | Y | 3 |
| 1 | 21 | 9600 | OWN | 5.0 | EDUCATION | B | 1000 | 11.14 | 0 | 0.10 | N | 2 |
| 2 | 25 | 9600 | MORTGAGE | 1.0 | MEDICAL | C | 5500 | 12.87 | 1 | 0.57 | N | 3 |
| 3 | 23 | 65500 | RENT | 4.0 | MEDICAL | C | 35000 | 15.23 | 1 | 0.53 | N | 2 |
| 4 | 24 | 54400 | RENT | 8.0 | MEDICAL | C | 35000 | 14.27 | 1 | 0.55 | Y | 4 |
| 5 | 21 | 9900 | OWN | 2.0 | VENTURE | A | 2500 | 7.14 | 1 | 0.25 | N | 2 |
| 6 | 26 | 77100 | RENT | 8.0 | EDUCATION | B | 35000 | 12.42 | 1 | 0.45 | N | 3 |
| 7 | 24 | 78956 | RENT | 5.0 | MEDICAL | B | 35000 | 11.11 | 1 | 0.44 | N | 4 |
| 8 | 24 | 83000 | RENT | 8.0 | PERSONAL | A | 35000 | 8.90 | 1 | 0.42 | N | 2 |
| 9 | 21 | 10000 | OWN | 6.0 | VENTURE | D | 1600 | 14.74 | 1 | 0.16 | N | 3 |
df.shape[0],df.shape[1]
(32581, 12)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32581 entries, 0 to 32580 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person_age 32581 non-null int64 1 person_income 32581 non-null int64 2 person_home_ownership 32581 non-null object 3 person_emp_length 31686 non-null float64 4 loan_intent 32581 non-null object 5 loan_grade 32581 non-null object 6 loan_amnt 32581 non-null int64 7 loan_int_rate 29465 non-null float64 8 loan_status 32581 non-null int64 9 loan_percent_income 32581 non-null float64 10 cb_person_default_on_file 32581 non-null object 11 cb_person_cred_hist_length 32581 non-null int64 dtypes: float64(3), int64(5), object(4) memory usage: 3.0+ MB
df.describe()
| person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|
| count | 32581.000000 | 3.258100e+04 | 31686.000000 | 32581.000000 | 29465.000000 | 32581.000000 | 32581.000000 | 32581.000000 |
| mean | 27.734600 | 6.607485e+04 | 4.789686 | 9589.371106 | 11.011695 | 0.218164 | 0.170203 | 5.804211 |
| std | 6.348078 | 6.198312e+04 | 4.142630 | 6322.086646 | 3.240459 | 0.413006 | 0.106782 | 4.055001 |
| min | 20.000000 | 4.000000e+03 | 0.000000 | 500.000000 | 5.420000 | 0.000000 | 0.000000 | 2.000000 |
| 25% | 23.000000 | 3.850000e+04 | 2.000000 | 5000.000000 | 7.900000 | 0.000000 | 0.090000 | 3.000000 |
| 50% | 26.000000 | 5.500000e+04 | 4.000000 | 8000.000000 | 10.990000 | 0.000000 | 0.150000 | 4.000000 |
| 75% | 30.000000 | 7.920000e+04 | 7.000000 | 12200.000000 | 13.470000 | 0.000000 | 0.230000 | 8.000000 |
| max | 144.000000 | 6.000000e+06 | 123.000000 | 35000.000000 | 23.220000 | 1.000000 | 0.830000 | 30.000000 |
## Checking for Duplicates
dups = df.duplicated()
dups.value_counts()
False 32416 True 165 dtype: int64
## Removing the Duplicates
df.drop_duplicates(inplace=True)
#drop 'interest rate' feature
df.drop(['loan_int_rate'],axis=1,inplace=True)
# separating the numerical/categorical features for preprocessing
ccol=df.select_dtypes(include=["object"]).columns
ncol=df.select_dtypes(include=["int","float"]).columns
print("The number of Categorical columns are:",len(ccol))
print("The number of Numerical columns are:",len(ncol))
The number of Categorical columns are: 4 The number of Numerical columns are: 7
#Printing the different columns with their cardinality (number of unique elements in each column):
print("The NUMERICAL columns are:\n")
for i in ncol:
print("->",i,"-",df[i].nunique())
print("\n---------------------------\n")
print("The CATEGORICAL columns are:\n")
for i in ccol:
print("->",i,"-",df[i].nunique())
The NUMERICAL columns are: -> person_age - 58 -> person_income - 4295 -> person_emp_length - 36 -> loan_amnt - 753 -> loan_status - 2 -> loan_percent_income - 77 -> cb_person_cred_hist_length - 29 --------------------------- The CATEGORICAL columns are: -> person_home_ownership - 4 -> loan_intent - 6 -> loan_grade - 7 -> cb_person_default_on_file - 2
#Checking ranges of numerical variables
for col in ncol:
min_value = df[col].min()
max_value = df[col].max()
print(f'Range for {col} : [{min_value} to {max_value}]')
Range for person_age : [20 to 144] Range for person_income : [4000 to 6000000] Range for person_emp_length : [0.0 to 123.0] Range for loan_amnt : [500 to 35000] Range for loan_status : [0 to 1] Range for loan_percent_income : [0.0 to 0.83] Range for cb_person_cred_hist_length : [2 to 30]
# plotting all the categorical features
plt.figure(figsize=(10,7))
for index, col in enumerate(ccol):
plt.subplot(2,3, index+1)
sns.countplot(x=col, hue='loan_status', data=df, palette='Blues')
plt.xticks(rotation=90)
plt.tight_layout()
# Individual frequency plot
plt.figure(figsize=(10,7))
for index, col in enumerate(ccol):
plt.subplot(2,3, index+1)
sns.countplot(x=col, palette='Blues', data= df)
plt.xticks(rotation=90)
plt.tight_layout()
# making a pie chart for loan intent feature
loan_intent_counts = df['loan_intent'].value_counts()
# Create the pie chart using Plotly
fig = px.pie(loan_intent_counts, names=loan_intent_counts.index, values=loan_intent_counts.values,
title='Pie Chart of Loan Intent', color_discrete_sequence=px.colors.sequential.Viridis)
# Show the plot
fig.show()
#making a bar plot for the home-ownership feature
mean_income_by_ownership = df.groupby('person_home_ownership')['person_income'].mean().reset_index()
# Create the bar plot using Plotly
fig = px.bar(mean_income_by_ownership, x='person_home_ownership', y='person_income',
title='Mean Person Income by Home Ownership', color='person_home_ownership',
color_discrete_sequence=px.colors.sequential.Viridis)
# Show the plot
fig.show()
#plotting the distibution of the 'age' feature
plt.figure(figsize=(8, 6))
df['person_age'].plot.hist(bins=10, color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Person Age')
plt.show()
#plotting a boxplot for the 'person-income' feature (for outliers detection)
plt.figure(figsize=(8, 6))
df.boxplot(column='person_income', vert=False)
plt.xlabel('Income')
plt.title('Boxplot of Person Income')
plt.show()
we concider all the boorowers making more than 2M as outliers
# making a correlation matrix to see to relations between numerical features
correlation_matrix = df.corr()
# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
# Add title
plt.title('Correlation Heatmap')
# Show the plot
plt.show()
person_age -> cb_person_cred_hist_length: A strong positive correlation between a person's age and length of credit history may indicate that older people tend to have longer credit histories. This is usually expected because older people have had more time to establish their credit history.
loan_amount -> loan_percent_income: The strong positive correlation between the amount of the loan and the percentage of income allocated to the loan suggests that the amounts of loans granted generally increase as the percentage of income allocated to loan repayment increases. This may indicate that lenders give higher loan amounts to those who spend more of their income on repayment.
loan_amount -> person_income: The strong positive correlation between the amount of the loan and the person's income indicates that people with higher incomes tend to obtain higher loan amounts. This is usually expected, as higher income may be associated with greater repayment capacity.
loan_status -> loan_percent_income: The strong positive correlation between loan status and the percentage of income allocated to the loan suggests that loans with higher income percentages may have higher odds of defaulting.
loan_status -> loan_int_rate: The strong positive correlation between loan status and interest rate indicates that loans with higher interest rates may have higher chances of defaulting.
person_income -> loan_percent_income: The strong negative correlation between person income and the percentage of income allocated to the loan indicates that people with higher income generally allocate a smaller portion of their income to loan repayment.
# Create the scatter plot using Plotly
fig = px.scatter(df, x='person_age', y='person_income',
title='Scatter Plot of Age vs. Income',
color='person_income',
color_continuous_scale=px.colors.sequential.Viridis)
# Show the plot
fig.show()
# Calculate the sum of 'person_income' for each category of 'person_home_ownership'
income_by_ownership = df.groupby('person_home_ownership')['person_income'].sum().reset_index()
# Get the list of categories and the total income for each category
categories = income_by_ownership['person_home_ownership']
total_income = income_by_ownership['person_income']
plt.figure(figsize=(10, 6))
plt.bar(categories, total_income, color='skyblue')
plt.xlabel('Home Ownership')
plt.ylabel('Total Income')
plt.title('Total Income by Home Ownership')
plt.show()
# Create histograms for each numerical column
plt.figure(figsize=(12, 8))
for i, col in enumerate(ncol[:6], 1): # Limit to 6 columns to fit in the grid
plt.subplot(2, 3, i)
plt.hist(df[col], bins=20, edgecolor='black')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Box plot: loan_status vs. loan_percent_income
plt.figure(figsize=(8, 6))
plt.boxplot([df[df['loan_status'] == 0]['loan_percent_income'],
df[df['loan_status'] == 1]['loan_percent_income']],
labels=['Paid', 'Default'], showfliers=False, notch=True, patch_artist=True)
plt.xlabel('Loan Status')
plt.ylabel('Loan Percent Income')
plt.title('Box Plot: Loan Status vs. Loan Percent Income')
plt.show()
#ploting the distibution of risk
sns.countplot(x=df['loan_status'], palette='Oranges')
plt.title('Distribution of Risk')
plt.show()
# pie chart whith percentages
df['loan_status'].value_counts().plot(kind='pie', autopct='%1.2f%%', explode=[0,0.1], shadow=True)
<AxesSubplot:ylabel='loan_status'>
The Data is highly IMBALANCED. We will deal with oversampling techniques like KNN-SMOTE to solve this issue.
Missing data, or missing values, occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.
There are typically 3 types of missing values:
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
Problems: Missing data are problematic because, depending on the type, they can sometimes cause sampling bias. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample.
df.isnull().any()
person_age False person_income False person_home_ownership False person_emp_length True loan_intent False loan_grade False loan_amnt False loan_status False loan_percent_income False cb_person_default_on_file False cb_person_cred_hist_length False dtype: bool
df.isna().sum()
person_age 0 person_income 0 person_home_ownership 0 person_emp_length 887 loan_intent 0 loan_grade 0 loan_amnt 0 loan_status 0 loan_percent_income 0 cb_person_default_on_file 0 cb_person_cred_hist_length 0 dtype: int64
msno.bar(df)
<AxesSubplot:>
NOTE: EVERY PREPROCESSING TECHNIQUE IS DONE ONLY ON THE TRAIN SET. SO SPLITTING IS MANDATORY BEFORE OUTLIER REMOVAL, MISSING VALUES HANDLING, OVERSAMPLING, ETC...
# we split the data to train / test parts
X_train, X_test, y_train, y_test = train_test_split(df.drop('loan_status', axis=1), df['loan_status'],
random_state=0, test_size=0.2, stratify=df['loan_status'],
shuffle=True)
#print the number of unique values:
for col in X_train:
print(col, '--->', X_train[col].nunique())
#if the unique values are more than 20 per feature we show the percentage
if X_train[col].nunique()<20:
print(X_train[col].value_counts(normalize=True)*100)
print()
person_age ---> 58 person_income ---> 3680 person_home_ownership ---> 4 RENT 50.320068 MORTGAGE 41.439149 OWN 7.916859 OTHER 0.323924 Name: person_home_ownership, dtype: float64 person_emp_length ---> 36 loan_intent ---> 6 EDUCATION 19.809502 MEDICAL 18.787598 VENTURE 17.542033 PERSONAL 16.878760 DEBTCONSOLIDATION 15.968687 HOMEIMPROVEMENT 11.013420 Name: loan_intent, dtype: float64 loan_grade ---> 7 A 32.932284 B 32.126330 C 19.902052 D 11.121394 E 3.004010 F 0.732685 G 0.181243 Name: loan_grade, dtype: float64 loan_amnt ---> 710 loan_percent_income ---> 75 cb_person_default_on_file ---> 2 N 82.392411 Y 17.607589 Name: cb_person_default_on_file, dtype: float64 cb_person_cred_hist_length ---> 29
X_train.loc[X_train['person_age']>=80, :]
X_train = X_train.loc[X_train['person_age']<=80, :]
X_train.loc[X_train['person_emp_length']>=60, :]
X_train = X_train.loc[X_train['person_emp_length']<60, :]
X_train.loc[X_train['person_income']>=2000000, :]
X_train = X_train.loc[X_train['person_income']<=2000000, :]
# to keep the same rows between x_train and y_train (deleting from y_train the row that were deleted from x_train)
y_train = y_train[X_train.index]
y_train.shape
(25196,)
1- Iterative imputer - To handle missing values
2- Scaling - To maintain the scale among features
1- One Hot Encoder - To encode each categoric for model interpretability
#Create the main pipeline for preprocessing numerical variables:
numerical_pipeline = Pipeline([
('imputer', IterativeImputer()), # Impute missing values using iterative imputer
('scaler', StandardScaler()) # Scale numerical features
])
#Create the pipeline for preprocessing categorical variables:
categorical_pipeline = Pipeline([
('encoder', OneHotEncoder()) # One-hot encode categorical features
])
# Replace 'numerical_features' and 'categorical_features' with lists of your numerical and categorical feature names
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
categorical_features = X_train.select_dtypes(include='object').columns.tolist()
preprocessor = ColumnTransformer([
('numerical', numerical_pipeline, numerical_features),
('categorical', categorical_pipeline, categorical_features)
])
#Fit and transform the main pipeline on the training data:
X_train_preprocessed = preprocessor.fit_transform(X_train)
def fit_preprocessing_pipeline(X_train):
return preprocessing_pipeline.fit(X_train)
#saving the pipeline preprocessor
joblib.dump(preprocessor, 'preprocessing_pipeline.pkl')
['preprocessing_pipeline.pkl']
#fit smote to the data
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_preprocessed, y_train)
# Replace numeric class labels with words
class_labels_mapping = {0: 'paid', 1: 'default'}
y_train_mapped = y_train.map(class_labels_mapping)
y_train_balanced_mapped = pd.Series(y_train_balanced).map(class_labels_mapping)
# Create bar plot for class distribution before SMOTE with words
plt.figure(figsize=(6, 4))
y_train_mapped.value_counts().plot(kind='bar')
plt.xlabel('loan_status')
plt.ylabel('Count')
plt.title('Class Distribution Before SMOTE')
plt.xticks(rotation=0)
plt.show()
# Create bar plot for class distribution after SMOTE with words
plt.figure(figsize=(6, 4))
y_train_balanced_mapped.value_counts().plot(kind='bar')
plt.xlabel('loan_status')
plt.ylabel('Count')
plt.title('Class Distribution After SMOTE')
plt.xticks(rotation=0)
plt.show()
#keeping the new balanced data in a new variable
X_test_processed = preprocessor.fit_transform(X_test)
# Define the models and their respective hyperparameter grids
models = {
'XGBoost': (XGBClassifier(), {'n_estimators': [i*100 for i in range(4)], 'max_depth': [6,8,10], 'learning_rate': [0.01, 0.05, 0.1]}),
'Logistic Regression': (LogisticRegression(), {'C': [0.01, 0.1, 1, 10]}),
#'SVM': (SVC(), {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}),
'Neural Network': (MLPClassifier(), {'hidden_layer_sizes': [(100,), (100, 50)], 'activation': ['relu', 'tanh']}),
'Random Forest': (RandomForestClassifier(random_state=0, class_weight='balanced'), {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}),
}
# Define a dictionary to store the evaluation metrics for each model
evaluation_metrics = {
'Model': [],
'Cross-Val Score': [],
'Accuracy': [],
'F1 Score': [],
'MSRE': []
}
# Create a dictionary to store the best models
best_models = {}
# Perform cross-validation and hyperparameter tuning for each model
for model_name, (model, param_grid) in models.items():
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train_balanced, y_train_balanced)
print(f"Model: {model_name}")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}\n")
# Append the evaluation metrics to the dictionary
evaluation_metrics['Model'].append(model_name)
evaluation_metrics['Cross-Val Score'].append(grid_search.best_score_)
# Get the best model
best_model = grid_search.best_estimator_
# Store the best model in the dictionary
best_models[model_name] = best_model
# Predict the test set using the best model
y_pred = best_model.predict(X_test_processed)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
evaluation_metrics['Accuracy'].append(accuracy)
# Calculate F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
evaluation_metrics['F1 Score'].append(f1)
# Calculate MSRE
msre = mean_squared_error(y_test, y_pred)
evaluation_metrics['MSRE'].append(msre)
# Convert the dictionary to a Pandas DataFrame for easy plotting
metrics_df = pd.DataFrame(evaluation_metrics)
# Plot the evaluation metrics
plt.figure(figsize=(10, 6))
plt.bar(metrics_df['Model'], metrics_df['Cross-Val Score'], label='Cross-Val Score', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['Accuracy'], label='Accuracy', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['F1 Score'], label='F1 Score', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['MSRE'], label='MSRE', alpha=0.7)
plt.xticks(rotation=45)
plt.xlabel('Model')
plt.ylabel('Score')
plt.title('Model Evaluation Metrics')
plt.legend()
plt.tight_layout()
plt.show()
# After the loop, training is complete
print("Training completed!")
Model: XGBoost
Best parameters: {'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 300}
Best cross-validation score: 0.955
Model: Logistic Regression
Best parameters: {'C': 10}
Best cross-validation score: 0.801
Model: Neural Network
Best parameters: {'activation': 'tanh', 'hidden_layer_sizes': (100, 50)}
Best cross-validation score: 0.908
Model: Random Forest
Best parameters: {'max_depth': None, 'n_estimators': 300}
Best cross-validation score: 0.948
Training completed!
evaluation_metrics = pd.DataFrame(evaluation_metrics)
metrics_to_plot = ['Cross-Val Score', 'Accuracy', 'F1 Score', 'MSRE']
for metric in metrics_to_plot:
plt.figure(figsize=(8, 6))
plt.bar(evaluation_metrics['Model'], evaluation_metrics[metric], alpha=0.7)
plt.xticks(rotation=45)
plt.xlabel('Model')
plt.ylabel('Score')
plt.title(f'Model Evaluation Metric: {metric}')
plt.tight_layout()
plt.show()
models = {
'XGBoost': XGBClassifier(learning_rate = 0.1, max_depth= 8, n_estimators = 300),
'Logistic Regression': LogisticRegression(C = 10),
#'SVM': (SVC(), {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}),
'Neural Network': MLPClassifier(activation = 'tanh', hidden_layer_sizes = (100, 50)),
'Random Forest': RandomForestClassifier(random_state=0, class_weight= 'balanced',max_depth = None, n_estimators = 300)
}
# Create a function to plot the learning curve
def plot_learning_curve(model, X, y):
train_sizes, train_scores, test_scores = learning_curve(model, X, y, train_sizes=np.linspace(0.1, 1.0, 5), cv=5, scoring='accuracy')
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8, 6))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color='orange')
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.plot(train_sizes, test_mean, 'o-', color='orange', label='Cross-Validation Score')
plt.xlabel('Training Examples')
plt.ylabel('Score')
plt.title(f'Learning Curve for {model.__class__.__name__}')
plt.legend()
plt.grid(True)
plt.show()
# Loop over the models and plot the learning curve for each
for model_name, model in models.items():
plot_learning_curve(model, X_train_balanced, y_train_balanced)
# Create a dictionary to store confusion matrices for each model
conf_matrices = {}
# Loop over the models and calculate the confusion matrix for each
for model_name, model in best_models.items():
y_pred = model.predict(X_test_processed)
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrices[model_name] = conf_matrix
# Plot the confusion matrices
plt.figure(figsize=(12, 8))
for i, (model_name, conf_matrix) in enumerate(conf_matrices.items()):
plt.subplot(2, 2, i + 1)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False, square=True)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title(f'Confusion Matrix - {model_name}')
plt.tight_layout()
plt.show()
# Save the ML Pipeline:
joblib.dump(model, 'best_model.pkl')
['best_model.pkl']
in this project i have been exposed to a lot of concepts like:
--> building a pipeline
--> hyperparameter tuning
--> evaluating models
--> building my first streamlit application
--> deploying it.
In conclusion, this credit risk analysis project demonstrates the power of data science and machine learning in the financial industry. This web app can serve as a valuable tool for financial institutions to assess credit risk, make informed lending decisions, and mitigate potential losses.
However, as with any data science project, there are a few points to keep in mind:
Continuous monitoring: Credit risk is a dynamic domain, and models need regular updates to adapt to changing economic conditions and borrower behaviors.
Model Robustness: Although the achieved accuracy is excellent, it's essential to test the model's robustness on a wider range of scenarios and data distributions.
Ethical Considerations: Credit risk models must be fair and unbiased. Continuously monitor for any potential bias and ensure fairness in lending decisions.
Model Deployment: Deploying a machine learning model in production involves careful considerations, such as scalability, security, and version control.